---
id: tutorial-dataset-confident
title: Using Datasets for Evaluation
sidebar_label: Using Datasets for Evaluation
---

To **start using your datasets for evaluation**, you’ll need to:

1. **Pull your dataset** from Confident AI.
2. Compute the actual outputs and retrieval contexts, and **convert your goldens into test cases**.
3. Begin running **evaluations**.

In this tutorial, we’ll pull the synthetic dataset generated in the previous section and run evaluations on the dataset against the three metrics we’ve defined: Answer Relevancy, Faithfulness, and Professionalism.

## Pulling Your Dataset

To pull a dataset from Confident AI, simply call the `pull` method from an `EvaluationDataset` and specify the dataset alias you wish to retrieve. By default, `auto_convert_goldens_to_test_cases` is set to `True`, but we'll set it to `False` for this tutorial since the actual output is a required parameter in an `LLMTestCase`, and we haven't generated them yet.

```python
from deepeval import EvaluationDataset

dataset = EvaluationDataset()
dataset.pull(alias="Patients Seeking Diagnosis", auto_convert_goldens_to_test_cases=False)
```

## Converting Goldens to Test Cases

Next, we'll convert the goldens in the dataset we pulled into `LLMTestCase`s and add them to our evaluation dataset. Although our goldens have contexts and expected outputs, we won’t need them for our current set of metrics.

```python
from deepeval.test_case import LLMTestCase

for golden in dataset.goldens:
    # Compute actual output and retrieval context
    actual_output = "..."  # Replace with logic to compute actual output
    retrieval_context = "..."  # Replace with logic to compute retrieval context

    dataset.add_test_case(
        LLMTestCase(
            input=golden.input,
            actual_output=actual_output,
            retrieval_context=retrieval_context
        )
    )
```

## Evaluating Your Dataset

Finally, we'll redefine our three metrics and use the `evaluate` function to run evaluations on our synthetic dataset.

```python
from deepeval import evaluate

...
evaluate(
  dataset,
  metrics = [answer_relevancy_metric, faithfulness_metric, professionalism_metric],
  hyperparameters={"model": MODEL, "prompt template": SYSTEM_PROMPT, "temperature": TEMPERATURE}
)
```

Here are the final evaluation results:

<div
  style={{
    display: "flex",
    alignItems: "center",
    justifyContent: "center",
  }}
>
  <img
    src="https://deepeval-docs.s3.amazonaws.com/tutorial_datasets_07.png"
    alt="Datasets 1"
    style={{
      height: "auto",
      maxHeight: "800px",
      marginBottom: "20px",
    }}
  />
</div>

You can see that although we passed all 5 test cases previously, it's important to test on a larger dataset, as 4 out of the 15 test cases we generated are still failing. To learn more about iterating on your hyperparameters, you can [revisit this section](/tutorials/tutorial-evaluations-hyperparameters).
